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Abstract 

A Bayesian approach to variable selection which is based on the expected Kullback- 
Leibler divergence between the full model and its projection onto a submodel has 
recently been suggested in the literature. Here we extend this idea by considering 
projections onto subspaces defined via some form of Li constraint on the parameter in 
the full model. This leads to Bayesian model selection approaches related to the lasso. 
In the posterior distribution of the projection there is positive probability that some 
components are exactly zero and the posterior distribution on the model space induced 
by the projection allows exploration of model uncertainty. We also consider use of the 
approach in structured variable selection problems such as ANOVA models where it is 
desired to incorporate main effects in the presence of interactions. Here we make use of 
projections related to the non-negative garotte which are able to respect the hierarchical 
constraints. We also prove a consistency result concerning the posterior distribution 
on the model induced by the projection, and show that for some projections related to 
the adaptive lasso and non-negative garotte the posterior distribution concentrates on 
the true model asymptotically. 
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1 Introduction 

Bayesian approaches to model selection and describing model uncertainty have become in- 
creasingly popular in recent years. In this paper we extend a method of variable selection 
considered by Dupuis and Robert (2003) and related to earher suggestions by Goutis and 
Robert (1998) and Mengersen and Robert (1996). Dupuis and Robert (2003) consider an 
approach to variable selection where models are selected according to a relative explanatory 
power, where relative explanatory power is defined using the expected KuUback-Leibler di- 
vergence between the full model and its projection onto a submodel. The goal is to find the 
most parsimonious model which achieves an acceptable loss of explanatory power compared 
to the full model. 

In this paper we consider an extension of the method of Dupuis and Robert (2003) where 
instead of considering a projection onto a subspace defined by a set of active covariates we 
consider projection onto a subspace defined by some other form of constraint on the param- 
eter in the full model. Certain choices of the constraint (an Li constraint as used in the 
lasso of Tibshirani (1996), for example) lead to exact zeros for some of the coefficients. The 
kinds of projections we consider also have computational advantages, in that parsimony is 
controlled by a single continuous parameter and we avoid the search over a large and complex 
model space. Searching the model space in traditional Bayesian model selection approaches 
with large numbers of covariates is a computationally daunting task. In contrast, with our 
method we handle model uncertainty in a continuous way through an encompassing model, 
and can exploit existing fast lasso type algorithms to calculate projections for samples from 
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the posterior distribution in the encompassing model. This allows sparsity and exploration 
of model uncertainty while preserving approximately posterior predictive behaviour based 
on the full model. Furthermore, the method is easy to implement given a combination of 
existing Bayesian software and software implementing lasso type fitting methods. While the 
method is more computationally intensive than calculating a solution path for a classical 
shrinkage approach like the lasso, this is the price to be paid for exploring model uncertainty 
and the computational demands of the method are certainly less than Bayesian approaches 
which search the model space directly. Our idea can also be applied to structured variable 
selection problems such as those arising in ANOVA models where we might wish to include 
interaction terms only in the presence of the corresponding main effects. A plug-in version of 
our approach is also related to the preconditioning method of Paul et al. (2007) for feature 
selection in "large p, small n" regression problems. A key advantage of our approach is sim- 
plicity in prior specification. One is only required to specify a prior on the parameter in the 
full model, and not a prior on the model space or a prior on parameters for every submodel. 
Nevertheless, the posterior distribution on the model space induced by the projection can 
be used in a similar way to the posterior distribution on the model space in a traditional 
Bayesian analysis for exploring model uncertainty and different interpretations of the data. 

There are many alternative Bayesian strategies for model selection and exploring model 
uncertainty to the one considered here. Bayes factors and Bayesian model averaging (Kass 
and Raftery, 1995, Hoeting et a/., 1999, Fernandez et a/., 2001) are the traditional approaches 
to addressing issues of model uncertainty in a Bayesian framework. As already mentioned, 
prior specification can be very demanding for these approaches, although general default prior 
specifications have been suggested (Berger and Pericchi, 1996; O'Hagan, 1995). In our later 
examples we focus on generalized linear models, and Raftery (1996) suggests some reference 
priors for Bayesian model comparison in this context. Structuring priors hierarchically and 
estimating hyperparameters in a data driven way is another way to reduce the complexity 
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of prior specification, and this can work well (George and Foster, 2000). Various Bayesian 
predictive criteria for selection have also been suggested (Laud and Ibrahim, 1995, Gelfand 
and Ghosh, 1998, Spiegelhalter et ai, 2002). These approaches generally do not require 
specification of a prior on the model - however, prior specification for parameters in all 
models is still required and this can be quite demanding if there are a large number of 
models to be compared. Decision theoretic strategies which attempt to take account of the 
costs of data collection for covariates have also been considered (Lindley, 1968, Brown et al, 
1999, Draper and Fouskakis, 2000). Bayesian model averaging can also be combined with 
model selection as in Brown et al. (2002). The projection method of Dupuis and Robert 
(2003) that we extend here is related to the Bayesian reference testing approach of Bernardo 
and Rueda (2002) and the predictive method of Vehtari and Lampinen (2004). There are also 
less formal approaches to Bayesian model comparison including posterior predictive checks 
(Gelman, Meng and Stern, 1996) which are targeted according to the uses that will be made 
of a model. We see one application of the methods we describe here as being to suggest a 
small set of candidate simplifications of the full model which can be examined by such means 
as to their adequacy for specific purposes. 

The projection methods we use here for model selection are related to the lasso of Tibshi- 
rani (1996) and its many later extensions. There has been some recent work on incorporating 
the lasso into Bayesian approaches to model selection. Tibshirani (1996) pointed out the 
Bayesian interpretation of the lasso as a posterior mode estimate in a model with indepen- 
dent double exponential priors on regression coefficients. Park and Casella (2008) consider 
the Bayesian lasso, where estimators other than the posterior mode are considered - their 
estimators do not provide automatic variable selection via the posterior mode but conve- 
nient computation and inference are possible within their framework. Yuan and Lin (2005) 
consider a hierarchical prior formulation in Bayesian model comparison and a certain analyt- 
ical approximation to posterior probabilities connecting the lasso with the Bayes estimate. 
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Recently Griffin and Brown (2007) fiave also considered alternatives to double exponential 
prior distributions on the coefficients to provide selection approaches related to the adaptive 
lasso of Zou (2006). 

The structure of the paper is as follows. In the next section we briefly review the method 
of Dupuis and Robert (2003) and consider our extension of their approach. Computational 
issues and predictive inference are considered in Section 3, and then a consistency result 
relating to the posterior distribution on model space induced by the projection is proved in 
Section 4. We describe applications to structured variable selection in Section 5 and Section 
6 considers connections between our method and the preconditioning approach to selection 
of Paul et al. (2007) in "large p, small n" regression problems. Section 7 considers some 
examples and simulation studies and Section 8 concludes. 

2 Projection approaches to model selection 
2.1 Method of Dupuis and Robert 

Dupuis and Robert (2003) consider a method of model selection based on the Kullback- 
Leibler divergence between the true model and its projection onto a submodel. Suppose 
we are considering a problem of variable selection in regression, where Mp denotes the 
full model including all covariates and Ms is a submodel with a reduced set of covariates. 
Write f{y\6F,MF) and f{y\6s, Ms) for the corresponding likelihoods of the models with 
parameters dp and Os- Let d's — d'siOp) be the projection of Op onto the submodel Ms- 
That is, d's is the value for 6s for which f{y\Os, Ms) is closest in KuUback-Leibler divergence 
to f{y\dF,MF), so that 
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Let 

5{Ms,Mp) = J j loglj^l^ f{x\dp,MF)dxp{dF\y)ddp 

be the posterior expected KuUback-Leibler divergence between the full model and its KuUback- 
Leibler projection onto the submodel Ms- The relative loss of explanatory power for Ms 
is 

5{Ms,Mf) 



d{Ms,MF) 



S{Mo,Mf) 

where Mq denotes the model with no covariates and Dupuis and Robert (2003) suggest model 
selection by choosing the subset model most parsimonious for which d{Ms, Mp) < c where 
c is an appropriately small constant. If there is more than one model of the minimal size 
satisfying the bound, then the one with the smallest value of S{Ms, Mp) is chosen. Dupuis 
and Robert (2003) show that 5{Mq,Mf) can be interpreted as measuring the explanatory 
power of the full model, and using an additivity property of projections they show that 
d{Ms, Mp) < c guarantees that our chosen submodel S has explanatory power at least 
100(1 — c)% of the explanatory power of the full model. This interpretation is helpful in 
choosing c. For a predictive quantity A we further suggest approximating the predictive 
density p{A\y) for a chosen subset model S by 

Ps{A\y) = J p{A\e's,y)p{e's\y)de's (1) 

where p{0'g\y) is the posterior distribution of 0'g under the posterior distribution oi dp for 
the full model. 



2.2 Extension of the method 

To be concrete suppose we are considering variable selection for generalized linear models. 
Write yi, y„ for the responses with E{yi) — /i^ and suppose that each yj has a distribution 
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from the exponential family 



/ ( Vi] ^u^] = exp ( . ^}^" +c\y., 



where 9i = 9i{^i) is the natural parameter, is a scale parameter, the Ai are known weights 
and h{-) and c(-) are known functions. For a smooth invertible link function g{-) we have 
rji = g{fii) = xjf3 where Xi is a p- vector of covariates and /3 is a p-dimensional parameter 
vector. Writing X for the design matrix with zth row xj and r] = [rji, ...,7]^)^ (the values 
of rj are called the linear predictor values) we have r) = Xf3. We write f{y',f3) for the 
likelihood. Now let f3 be fixed and suppose we wish to find for some subspace 5* of the 
parameter space the Kullback-Leibler projection onto S. The subspace S might be defined 
by a subset of "active" covariates as in Dupuis and Robert (2003) but here we consider 
subspaces such as 



S = S{\) = \f3:J2m<^\ (2) 



s = 5(/3*,a) = |/3:XJi/5.i/i/5;i<a| (3) 

where f3* is a parameter value that supplies weighting factors in the constraint or 

I i=i j=i J 

The choice (|2| leads to procedures related to the lasso of Tibshirani (1996), ^ relates to the 
adaptive lasso of Zou (2006) and ^ is related to the elastic net of Zou and Hastie (2005). 
In ([3]) we have allowed the space that we are projecting onto to depend on some parameter 
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f3* , and later we will allow f3* to be the parameter in the full model that we are projecting, 
so that the subspace that we are projecting onto is adapting with the parameter. There is 
no reason to forbid this in what follows. In a later section we also consider projections which 
are related to Breiman's (1995) non-negative garotte and which allow for structured variable 
selection in the presence of hierarchical relationships among predictors. The close connection 
between the adaptive lasso and the non-negative garotte is discussed in Zou (2006). 

Now let f3g be a parameter in the subspace S. In the development below we consider 
the scale parameter as known - we consider an unknown scale parameter later. The 
Kullback-Leibler divergence between f{y;f3) and f{y;f3g) is 



where Ep denotes the expectation with respect to f{y; f3). Writing /ii(/3) and 6'j(/3) for the 
mean and natural parameter for yi when the parameter is /3 ([5| is given by 



where f{fi{f3);f3g) is the likelihood evaluated at f3g with data y replaced by fitted means 
/Li(/3) when the parameter is f3 and C represents terms not depending on f3g and hence irrel- 
evant when minimizing over f3g. So minimization with respect to f3g subject to a constraint 
just corresponds to minimization of the negative log-likelihood subject to a constraint but 
with data i-i{f3) instead of y. Dupuis and Robert (2003) observed that in the case where 
the subspace S is defined by a set of active covariates calculation of the Kullback-Leibler 
projection can be done using standard software for calculation of the maximum likelihood 




(5) 




log fiM; (3s) + C 
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estimator in generalized linear models: one simply "fits to the fit" using the fitted values 
for the full model instead of the responses y in the fitting for a subset model. Clearly for 
some choices of the response distribution the data y might be integer valued, but commonly 
generalized linear modelling software does not check this condition so that replacement of 
the data y with the fitted means for the full model can usually be done. 

In our case, suppose we wish to calculate the projection onto the subspace ([2]). We must 
minimize 

p 

-log/(^(/3);/35) subject to ^ \(3s,j\ < A 

i=i 

where f3s,j denotes the jth element of f3g. This is equivalent to minimization of 

p 

-\ogf{^i{(3y,(3s) + Sj2\PsJ 

for some 6 > 0. Here the calculation just involves the use of the lasso of Tibshirani (1996) 
where in the calculation the responses are replaced by the fitted values i-i{f3). In the case 
of a Gaussian linear model, the whole solution path over values of 6 can be calculated with 
computational effort equivalent to a single least squares fit (Osborne et al., 2000, Efron et 
ai, 2004). Efficient algorithms are also available for generalized linear models (Park and 
Hastie, 2007). Note that because the relationship between A and 6 depends on /3, generally 
we calculate the whole solution path over 6 in order to calculate the projection onto the 
subscpace defined by the constraint. One can consider other constraints apart from an Li 
constraint. For instance, ^ leads to minimization of 

-log fif,if3y,f3s) + 6 Y,\/3s,j\m 

for 5 > if we choose f3* = f3 which gives a certain adaptive lasso estimator (Zou, 2006) 
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obtained by fitting with data y replaced by n{l3). Also, ^ leads to minimization of 

- log /(/.(/3); f3s) + 6j2 \l3s„j I + 7 ^ f^h 

for positive constants 6 and 7 which is related to the elastic net of Zou and Hastie (2005). 

Later we focus on the lasso and adaptive lasso type projections for which the parameter A 
needs to be chosen. One way to do this is to follow a similar strategy to the one employed in 
Dupuis and Robert (2003). Writing Ms = Ms{\) for the model subject to the restriction ^ 
or ([3]), we choose A as small as possible subject to d{Ms-, Mp) < c. Note that choosing the 
single parameter A is much easier than searching over subsets as in Dupuis and Robert (2003), 
and that d{Ms, Mp) increases monotonically as A decreases. An alternative to choosing c 
based on relative explanatory power would be to directly choose the observed sparsity in the 
model: that is, to choose A so that the posterior mean of the number of active components 
in the projection is equal to some specified value. The relative loss of explanatory power 
can also be reported for this choice. Another possibility is to avoid choosing A at all, but 
instead to simply report the characteristics of the models appearing on the solution path 
over different samples from the posterior distribution in the full model. 

2.3 Gaussian response with unknown variance 

So far we have considered the scale parameter to be known. For binomial and Poisson 
responses 0=1, but we also wish to consider Gaussian linear models with unknown variance 
= cr^. Calculation of projections is still straightforward in the Gaussian linear model with 
unknown variance parameter. Now suppose we have mean and variance parameter f3 and 
cr^, and write f3g and cr| for corresponding parameter values in some subspace S. In this 
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case ([5]) is 



\ i=l j=l 

1 " 

= -| log27ra2 - I + I log27r4 + ^ ^^(a^ + (/.,(/3) - /i.(/3,))2) 
Minimization with respect to subject to an Li constraint involves minimization of 

n p 

i=i j=i 

and the minimizer f3'g over f3g is independent of the value of Once the projection f3'g is 
calculated, the projection aV is easily shown from the expression above to be 



as =(r 



n 



3 Computation and predictive inference 

We have already discussed computation of the Kullback-Leibler projection onto subspaces of 
certain forms in generalized linear models. Hence generating from the posterior distribution 
of the projection is easily done - we simply generate a sample from the posterior distribution 
p{f3\y) of the parameter, f3^^\ fB^'^^ say, and then for each of these parameter values we 
calculate the corresponding projections /3^^-* , ■■.,f3^^'^ . Note that the pattern of sparsity of 
the projection is different for different samples from the posterior distribution, so that the 
posterior distribution of the projection provides one way of exploring model uncertainty. We 



11 



approximate the predictive density ([T| by 



-^p(A|/3«',?/) 



where p(A|/3'-*^ , y) denotes the predictive distribution for A given the parameter value /3^*^ 
and data y. We can write 7^- = J(/5^- 7^ 0) for the indicator of whether or not the jth 



component of the projection of /3 is nonzero, and 7' = (7^, 7')^. Then we can write 



These expressions for predictive densities are formally similar to those arising in Bayesian 
model averaging, where different values for the indicators 7' define different models. Of 
course, the posterior distribution on 7' cannot be interpreted in quite the same way as the 
posterior distribution on the model space in a formal Bayesian approach to model com- 
parison, but we still believe that examining p(7'|2/) can be helpful for exploring different 
interpretations of the data in our approach. We illustrate this in the examples below. 

4 Consistent model selection 

Let /3° denote the true parameter, and for any /3 write = {A; : /3fc 7^ 0} so that for 

instance ^(/J") is the set of nonzero coefficients for the true parameter. Suppose that /3 is 



P5(A|y) in a different form to ([T]), namely 



P5(A|?/) = 5^p(7'|?/MA|7',?/) 
7' 



where 
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some fixed parameter value and consider which minimizes 

p 

- logp{ti{f3);f3s) subject to ^ \f3s,j\/\f3,\ < A. (6) 

i=i 

with respect to f3g. That is, we consider in this section Kullback-Leibler projections for 
subspaces of the form ^ related to the adaptive lasso of Zou (2006). A similar result to the 
one below can be proved for some projections related to the non-negative garotte in view of 
the close connection between the adaptive lasso and the non- negative garotte (Zou, 2006). 
We will examine projections related to the non-negative garotte when we look at structured 
variable selection problems later. 

Considering only the case of a generalized linear model, the minimization problem above 
is equivalent to minimization of 

n p 

Mmif3s) + KWs))} + 7 E \(^s,\m (7) 
i=i j=i 

where there is a natural one to one correspondence between 7 and A in ([6]) and for simplicity 
we are considering the case where the observation specific weights (p/Ai are all equal. Actu- 
ally, as mentioned earlier, the relationship between 7 and 6 depends on /3, but this can be 
ignored in what follows: if we take A = ^A{f3^) + Op{l/y/n), where jj^A{(3^) is the number 
of the entries in A{f3^), the consistency result for ^ can be similarly established. We also 
consider henceforth the natural link function di{(3) = xjf3, although the arguments below 
extend easily to other link functions and the case of unequal weights for the observations. 

If /3 is a sample from the posterior distribution then under general conditions it is a root- 
n consistent estimator, and we know that for any e G [0, 1] there exists C not depending on 
n such that with Afn = {ex : \\cy. — f3^\\ < C j ^fn]^ Pr{f3 G Afn) > 1 — e where the probability 
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is calculated with respect to the distribution 



q{(3) = / p{l3\y)p{y\(3'')dy. 



We will show that 

lim Pr{A{P's) = ^(/3°) for every /3 e Xn) = I 
for a suitable sequence of values 7„ for 7 and hence 

lim Pr{A{l3's) = = 1- 

n— >oo 

That is, the posterior distribution of the projection indicates the correct model with prob- 
ability one as n — s> 00 for a suitable choice of the sequence of parameters 7^ defining the 
projection. 

We assume the following regularity conditions, which are the same as in Zou (2006). 

1. The Fisher information /(/3°) is positive definite; 

2. There is a large enough open set O containing the true parameter f3^ such that V/3 G O, 

\b"'{x'^f3)\ < M{x) < 00 

and 

E[M{x)\xiXjXk\] < 00 

for any k. 

Theorem 1. For (3 G A/'„,, if 7n/v^ ^^^1 7„ ^ 00 as n — > 00, (3'g which minimizes ([t]) 
is consistent in variable selection and is y^-consistent. 
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The proof, which is an easy adaptation of a similar result in Zou (2006), is given in the 
Appendix. 

5 Structured variable selection 

In the last section we considered variable selection in generalized linear models: as before, 
write rj — XI3 where rj is the vector of linear predictor values, X is a design matrix and 
/3 is a parameter vector. In this section we will combine the structured variable selection 
approach of Yuan, Joseph and Zou (2007) which uses the non-negative garotte of Breiman 
(1995) with our projection approach to variable selection. 

Consider the model in which r]^X((3*o0) where (3* = ((31, (3;f and 6> = (^i, Opf 
are p- vectors of parameters with 9j > 0, Y^^j=i ^ P ^-^^ where o denotes element by 
element multiplication of two vectors. We write (3* instead of /3 to emphasize that (3* is a 
different parameter in a different model to the original one, although our original model can 
be recovered by setting (3* = (3 and a p-vector of ones. We can consider the projection of 
this parameter onto the subspace 

S = 5(/3, A) = {(/3*, ^) : /3* = /3, 6, < A, > 0, j = 1, 

To calculate the projection we need to minimize — logp(/u(/3); /3o0) with respect to 6 subject 
to 9j > and Y^^=i ^ Gaussian case, this is just Breiman's non-negative garotte 

applied to the fitted values /i(/3) rather than the data y. The minimization problem is easily 
solved. See Yuan, Joseph and Zou (2007) for computational details in the slightly more 
complicated situation of structured fitting of generalized linear models. As pointed out by 
Zou (2006), the non-negative garotte is very closely related to the adaptive lasso. 

In solving the minimization problem above we may find that some of the 6j are zero. This 
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allows variable selection in the original model for which we have replaced the parameter /3 
with /3* o 6. Yuan, Joseph and Zou (2007) also suggested a way in which hierarchical 
structure can be incorporated in the non-negative garotte, and we make use of this idea 
here. Following their notation, for the ith predictor (corresponding to the ith column of 
X) we write Vi for the set of predictors which are so-called parents of i. Under the strong 
heredity principle (Chipman, 1996) all the predictors in Dj must be included in the model 
before the ith predictor is included. Under the weak heredity principle at least one of the 
predictors in T>i must be included before the ith predictor is included. Yuan, Joseph and Zou 
(2007) suggest the constraints Oi < 9j for j e Vi to enforce the strong heredity principle and 
9i < X^jgxij enforce the weak heredity principle. The linear nature of the constraints 
ensures that computations are still tractable. 



6 "Large p, small n" problems and preconditioning 

Paul et al. (2007) suggest that in "large p, small n" regression problems with more predictors 
than observations it is beneficial to separate the problem of obtaining good predictions from 
that of variable selection. With this in mind, they suggest a two step procedure where first 
a good predictor y is found for the mean response, and then in a second stage a model 
selection and fitting procedure such as the lasso is applied with the responses y replaced by 
y. They show that such a procedure can perform better than application of the fitting and 
selection procedure to the raw outcome y. 

We note that their second stage of fitting to a set of fitted values (using the lasso in 
the case of a linear model for instance) commonly corresponds to calculation of a KuUback- 
Leibler projection if y is obtained by plugging in of a point estimate of the model parameters. 
Our approach is related though slightly different - we generate from the posterior distribution 
of the parameters, and for each draw from the posterior we fit to the corresponding set of 
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fitted values. 



7 Examples and simulations 
7.1 Low birthweight data 

We consider application of our approach to the low birthweight data of Hosmer and Lemeshow 
(1989). The data are concerned with 189 births at a US hospital. We consider a logistic 
regression model for a response which is a binary indicator for birthweight being less than 
2.5kg. The predictors in the model are shown in Table 1. These predictors had been shown 
to be associated with low birthweight in past studies and it was desired to find out which of 
the predictors were important for the medical centre where the data were collected. In our 
analysis we leave the binary predictors unchanged, but centre and scale the other predictors 
to have mean zero and variance one. We fit the full model with a prior on the coefficients 
that is normal, with mean vector and covariance matrix 3/ where / denotes the identity 
matrix. This is a fairly noninformative prior on the scale of the probabilities: note that 
making the prior variances of coefficients very large would correspond to a very informative 
prior on the probability scale where high prior probability is placed on coefficient values 
corresponding to most of the fitted values being close to zero or one. Coefficient estimates 
(posterior means) and posterior standard deviations obtained by fitting the full model are 
shown in Table 2. These results were obtained using the MCMCpack package in R (Martin 
and Quinn, 2007). To obtain the results reported we ran the MCMC scheme for 1000 "burn 
in" and 10000 sampling iterations. 

We consider our projection approach to selection with the lasso type constraint ^ as 
well as the adaptive lasso type constraint (jsj). For each sample from the posterior, the 
whole solution path was calculated for the projection as the parameter A was varied, and 
all the distict models on the path were recorded. Thus for each sample from the posterior, 
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Figure 1: Plot of relative loss of explanatory power versus posterior expected model size for 
low birth weight example. The solid line is for the lasso projection and the dashed line is for 
the adaptive lasso. 




n I I I I r 

2 4 6 8 10 

Average model size 



we should have roughly 11 distinct models (including the null and the full models) because 
there are 10 covariates. We say "roughly" 11 distinct models because it is possible for a 
variable to leave the model as the regularization parameter is increased in the solution path, 
but this is not very common in practice. Table [3] shows the two most frequently appearing 
models of each size across solution paths for all samples from the posterior, together with 
the relative frequency with which this model appears amongst models with the same number 
of covariates. The table only reports results for the adaptive lasso. The reason why we only 
report results for the adaptive lasso is shown in Figure [1} which gives the relative loss of 
explanatory power as a function of the posterior expected number of variables selected in 
the projection. In the figure, the solid line is for the lasso projection and the dashed line 
for the adaptive lasso - it can be seen that for the adaptive lasso there is a reduced loss of 
explanatory power compared to the lasso for a given level of parsimony. 
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Examining the models in Table [3} the indicators for number of first trimester physician 
visits (ftv), one of the indicators for race and age appear to be the least important covariates. 
This is consistent with other published analyses of this data set such as in Venables and 
Ripley (2002). They consider stepwise variable selection using AIC in a main effects model 
including all the covariates, which results in exclusion of the dummy variables coding for ftv 
and age. They also consider inclusion of second order interactions and note that there is some 
evidence for an interaction between ftv and age. Raftery and Zheng (2003) also consider 
some reference Bayesian model averaging approaches to the analysis of this dataset. Their 
conclusions concerning the important variables (based on marginal posterior probabilities 
of inclusion) are similar to ours, although it should be noted that their model is different 
with first trimester physician visits treated as a continuous covariate rather than being coded 
through two indicator variables as in our analysis, which follows Venables and Ripley (2002). 

7.2 Structured variable selection example 

Our next example concerns a variable selection problem with hierarchical structure. The 
data are simulated following a similar example discussed in Yuan, Joseph and Zou (2007). 
The purpose of considering this example is to show that the method described in Section 5 
which incorporates hierarchical constraints into variable selection is beneficial. In particular, 
we consider a model satisfying the strong heredity principle, and then show that a projection 
approach to variable selection which imposes strong heredity outperforms an approach which 
does not impose this constraint. By outperforms here we mean that for a given level of 
parsimony (a given value for the posterior expected number of nonzero components of the 
projection) we have a greater posterior probability for the model chosen via the projection to 
encompass the true model, with a relatively small loss of explanatory power due to imposing 
the constraint. 

In the example of Yuan, Joseph and Zou (2007) three predictors Xi, X2 and X3 are 
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simulated following a multivariate normal distribution with mean zero and Cov{Xi, Xj) = 
pl*-il for values p of —0.5, and 0.5. There are n = 50 observations simulated and 100 
different datasets are considered for each value of p. We consider fitting a model that 
includes Xi, X2, X3 and all second order interaction terms (nine possible terms in all - no 
intercept is fitted). The true model used to generate Y is 

Y = 3X1 + 2X2 + 1.5X1X2 + e 

where e ~ X(0,9). Note that this model respects the strong heredity principle - for the 
interaction term, the corresponding main effects are also included. We use two variants of 
our non-negative garotte approach to fitting the data. The first variant respects the strong 
heredity principle, and the second variant does not impose any constraint. 

We considered a grid of 100 equally spaced values for A between and 9 in the constraint 
Yl^=i ^ as described in Section 5. Using the usual noninformative prior in the Bayesian 
linear model on the regression coefficients and variance parameter of a^) oc a^^, we 
can simulate directly from the posterior distribution without the need for iterative methods 
(see, for instance, Gelman et ai, 2003). For each simulated dataset we generated 1000 
samples from the posterior distribution. For each value of A in the grid and each draw 
from the posterior distribution, we calculated projections (with and without the strong 
heredity constraint) recording the number of active variables, whether or not the projection 
encompassed the true model and the KuUback-Leibler divergence between the full model and 
the projections. 

Plotting the posterior expected values of these quantities against one another for the 
grid of values of A gives a sense of the trade off between parsimony, predictive accuracy 
and identification of the important variables. Figure [2] shows plots of the probability of 
encompassing the true model and of the explanatory loss versus posterior expected number 
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of variables selected in the projected model for the two projection methods (imposing strong 
heredity, solid line, and no constraint, broken line). For a given level of parsimony it can 
be seen that the projection method which imposes the hierarchical constraint has a higher 
posterior probability of encompassing the true model so that enforcing the strong heredity 
principle when it is approporiate is helpful for obtaining more parsimonious models and for 
identifying the important variables. 

7.3 "Large small n" regression 

We now consider some simulations for the "large p, small n" case where there are more 
predictors than observations. We consider generating 100 datasets with n = 20 and 40 
predictors. The datasets follow a linear model 

2/ = X/3 + e 

where e ~ A^(0, 5^/). Below we write Xi, for the ith row of the design matrix X. 

1. Example 1: set I3j = 0,j = 1,...,10, j = 21,. ..,30, f3j = 2,j = 11,...,20, j = 31, ...,40. 
We have Xi, ~ N{0,I). 

2. Example 2: set /3j = 4:, j = 1,...,5, /3j = 0, j = 6,. ..,40. We generate Xi, - A^(0, S) 
with 'Ejj — 1, j — 1, ...,40 and = 0.5 i ^ j. 

In the first example there is no multicoUinearity, but 20 active predictors. In the second 
example there is moderate multicoUinearity but only 5 active predictors. 

Since the number of predictors is double the number of observations in both examples, 
here we are considering a "large p, small n" situation. For a Bayesian analysis of the data 
with an encompassing model we consider the Bayesian lasso of Park and Casella (2008). They 
consider the following priors on parameters. If an intercept term (3o is included, this is given 
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Figure 2: Plots of posterior probability of encompassing the true model versus average 
model size (left column) and explanatory loss versus average model size (right column). The 
parameter p takes values of —0.5 (top) (middle) and 0.5 (bottom). Solid line is for strong 
heredity and broken line no constraint. 
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a flat prior p{(3o) oc 1 and this parameter can be integrated out of tlie model analytically. 
Conditional on the variance cr^, the f3j are conditionally independent in their prior with 



where A > is a shrinkage parameter. Finally an inverse gamma prior can be used for o"^, 
where we use 1(7(0.01,0.01). A hyperprior can be placed on A, or it can be estimated by 
marginal maximum likelihood as outlined in Park and Casella (2008) or by cross-validation. 
For illustrative purposes here we will fix A = 10 in the computations below. Park and Casella 
(2008) outline an efficient MCMC scheme for computations. 

We consider projections based on the adaptive lasso, and Figure |3] shows a plot of the false 
discovery rate (average number of variables incorrectly selected divided by average number 
of variables selected) versus average model size. The averages are over 100 simulation repli- 
cates. Quite a large model would need to be chosen to encompass all the active predictors. 
Note that if we use the classical lasso to do selection then the number of predictors chosen 
by the projection cannot be more than the number of observations. Figure |3] also shows the 
explanatory loss as a function of the average model size. Model uncertainty is considerable 
here, and we believe that the distribution on the model space defined by the projection is 
extremely valuable for exploring model uncertainty. Figure |4] shows for the first simulation 
replicate in each example the marginal posterior probabilities of the variables being nonzero 
in the projection. The projections in the figure correspond to average model size of 13 (ex- 
ample 1) and 10 (example 2) corresponding to approximately 20% explanatory loss in both 
cases. The lines show the mean values for these probabilities within the active and inactive 
groups. It is clear that there is some useful information in the posterior distribution of the 
projection for distinguishing active from inactive variables. It is important to realize that 
Figure |4] is examining posterior probabilities of selection in the projection for a single repli- 
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cate, not the frequentist behaviour of selection across rephcates - such frequentist behaviour 
is summarized by the false discovery rates of Figure [3j We also stress that posterior prob- 
abilities of selection in the projection depend on the prior in the encompassing model and 
the tolerable explanatory loss. 

8 Conclusion 

We have discussed the use of Kullback-Leibler projections related to the lasso as a tool for 
the exploration of model uncertainty. There are many possible extensions to our suggested 
framework. One interesting possibility which we are currently pursuing is the use of pro- 
jections related to versions of the lasso for selection on batches of parameters and random 
effects. 

Appendix 

Proof of Theorem 1: We write (3 = (3^ + v / ^/n, where < C. Denoting u = y/n{f3g — f3^), 
we define 

i=l ^ ^ j=l ^ 

and Z{u) = L{u) — L{0). The minimizer u' of L{u) gives the minimizer (3'g of ([t]) by 
f3'g = (3^ + u' l^/n, and hence to study (3'g it suffices to consider u' . Following Zou (2006), 
we decompose Z{u) as 

Z{u) = Zi{u) + Z2{u) + Z^{u) + Zi{u) 



24 



Figure 3: Plots of false discovery rate versus average model size (top row) and explanatory 
loss versus average model size (bottom row) for examples 1 (left) and 2 (right). Plotted 
points correspond to a grid of values for the constraint A. 




Figure 4: Plots of marginal posterior probability of inclusion versus variable for first sim- 
ulation replicate for example 1 (left) and 2 (right). Probabilities are for projections with 
average model size 13 (left) and 10 (right) corresponding in both cases to approximately 
20% explanatory loss. The lines show the mean posterior probabilities of inclusion among 
the active and inactive groups. 
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where 



Z,(n) = -J][/..(/3)-6'(a.M^ 

1=1 ^ 

^2(n) = 5;V(a;r/3V^^n 



Z3H = iYl 



n 



n 
i=l 



where (3* lies between (3 and (3 +u/y/n. Since ijLi{(3 ) — b'{(3 ), we can write the first term 
as 



Z,{u) = Y.^nm - /^^(/3°))^ = + o,{l/Vn)W)'^ = alu + o,(l 

1=1 * 1=1 * * 
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where ||a„|| — Op{l). For the second term ^2(1*), we have 



i=l 

Thus ^2(1*) 1/2 u^I{^^)u. For the third term, following the arguments in Zou (2006), 
we have 

/JO^^O 



/9° = and Uj = 
00 /9° = and Uj ^ 



since 7^ satisfies 7„/ \/n — > and 7„ — cxd. The fourth term is of the order Op{l/ y/n) as 

n ^ 

6VnZi{u) < -M{xi)\xju\^ E[M {x)\x^ u\^] < 00. 



From the above arguments, we must have 



ua = Op{l) and u^c -^a 0, 



where is the subvector of u corresponding to the coefficients in A{0^). Thus 0g is y/n- 

consistent, and Vj G ^(/3°), with probability tending to one, /3° is estimated by a nonzero 
coefficient. 

It suffices then to show that V/c ^ A{0^), with probability tending to one, /3° will be 
estimated by zero. Otherwise, by the Karush-Kuhn- Tucker optimality conditions, wc must 
have 



1 



Y.xMfi)-h'{x^Ps)) 

* 1=1 



In 



rsgn(/35,fc)- 



(8) 
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It is easy to see that the left hand side is equivalent to 
1 " 

1 " 
=0,(1). 

However, the right hand side satisfies 

7n 

since = + 0,(1/^^^) = 0,(1/^^) for Pk ^ AT and /^^ ^ ^. This contradicts ^ and 
the proof is completed. 
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Table 1: Predictors for low birth weights data set 



Predictor 


Description 


age 


age of mother in years 


Iwt 


weight of mother (lbs) at least menstrual period 


raceblack 


indicator for race=black (0/1) 


raceother 


indicator for race other than white or black (0/1) 


smoke 


smoking status during pregnancy (0/1) 


ptd 


previous premature labors (0/1) 


ht 


history of hypertension (0/1) 


ui 


has uterine irritability (0/1) 


ftvl 


indicator for one physician visit in first trimester (0/1) 


ftv2+ 


indicator for two or more physician visits in first trimester (0/1) 



Table 2: Posterior means and standard deviations of coefficients for full model fitted to low 
birthweight data. 



Predictor 


Posterior 


Posterior 




Mean 


Standard 






Deviation 


age 


-0.21 


0.21 


Iwt 


-0.48 


0.22 


raceblack 


1.06 


0.52 


raceother 


0.65 


0.43 


smoke 


0.68 


0.41 


ptd 


1.31 


0.47 


ht 


1.69 


0.67 


ui 


0.64 


0.46 


ftvl 


-0.49 


0.46 


ftv2 


0.11 


0.44 
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Table 3: Two most frequently appearing models of each size in solution path for the pro- 
jection together with relative frequency of each model within all appearances of model of 
the same size (Prob/Size). Zeros and ones in the columns labelled by the predictors show 
inclusion and exclusion for different models (rows). 



Model 
Size 










Predictor 












Prob/ 
Size 




age 


Iwt 


black 


other 


smoke 


ptd 


ht 


ui 


ftvl 


ftv2 




1 

















1 














0.48 


1 





1 


























0.17 


2 





1 











1 














0.24 


2 

















1 


1 











0.10 


3 





1 











1 


1 











0.13 


3 





1 


1 








1 














0.06 


4 





1 








1 


1 


1 











0.07 


4 





1 











1 


1 





1 





0.06 


5 





1 


1 





1 


1 


1 











0.05 


5 





1 


1 








1 


1 


1 








0.05 


6 





1 


1 


1 


1 


1 


1 











0.06 


6 


1 


1 


1 





1 


1 


1 











0.05 


7 





1 


1 


1 


1 


1 


1 


1 








0.10 


7 


1 


1 


1 


1 


1 


1 


1 











0.09 


8 


1 


1 


1 


1 


1 


1 


1 


1 








0.13 


8 





1 


1 


1 


1 


1 


1 


1 


1 





0.13 


9 


1 


1 


1 


1 


1 


1 


1 


1 


1 





0.29 


9 


1 


1 


1 


1 


1 


1 


1 


1 





1 


0.19 
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